WHAT IS CLAIMED IS : \ 

1 aNO / 1 . A method for generating a language component vocabulary VC 

2 for a speech recognition system havmg a language vocabulary V of a plurality of word 

3 v forms, the method comprising the steps of: 

4 partitioning the language vocabulary V into subsets of word forms based 

5 on frequencies of occurrence of the respective word forms; and 

6 in at least one of said subsets, splitting word forms having frequencies 

7 less than a threshold to thereby generate word form components. 

J! 2. The method of claim 1, wherein the frequencies of the word 

forms are estimated from a given textual corpus. 

al 3. The method of clainl 1, wherein said partitioning step includes 

the sub-step of numerating the plurality of word forms in the language vocabulary V in 

^ descending order based on the frequencies associated with each of the plurality of word 

4 forms. \ 

1 4. The method of claim 1, therein said partitioning step partitions 

2 the language vocabulary V into at least two subsets SI and S2, and said splitting step 

3 splits the word forms of subset S2 into 2-tuple components including stems and 

4 endings, but does not split the word forms of subset SI. 
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1 5. The method of claim |4, wherein said partitioning step further 

2 partitions the language vocabulary V into a third subset S3, with word forms therein 

3 being split in said splitting step into 3-tuple tomponents including prefixes, stems and 

4 endings. 

1 6. The method of claim 1,1 wherein said splitting is performed 

2 subject to a constraint in which a word that contains a given string of letters is 

3 prevented from being split within the string if he string of letters corresponds to one 

4 phoneme. 

y 7. The method of claim 1, wherein said splitting is performed using 

J a fixed vocabulary and a fixed list of allowable endings, with each word from the fixed 

s3 vocabulary being split into at least a stem and in ending that is an element of the fixed 

Uf set of endings, so as to substantially minimize fflie total number of all stems that are 

5 required to split every word in the fixed vocabulary. 

1 8. The method of claim 7, wherein the fixed set of allowable 

2 endings includes an empty ending. 

1 9. The method of claim 1, further comprising generating and storing 

2 a word form to corresponding word form components table. 
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10. The method of claim 9, further comprising the step of labeling 
each of the word form components stored in\said table to distinguish between stems, 
prefixes and endings. 
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11. The method of claim 1, further comprising the steps of: 
generating a map of said word forms to said word form components, 

said map further including each of a plurality of non-split words as being associated 

with itself; 

filtering a textual corpus using jthe map to generate a textual component 
corpus containing the non-split word forms and the word form components of the map; 

accumulating the word form cc mponents and the non-split word forms 
generated by said filtering step in an n-gram language model; and 

determining counts of n-tuple iiets of word form components and word 



forms to estimate n-gram probabilities for the 



12. The method of claim 1 



n-gram language model. 



wherein said filtering step maps every 



word in the corpus into a n-tuple word form component. 



J 
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13. A method for use in 



splitting an acoustic 
components and storing said baseform 

performing sound to spelling 
as to generate a baseform components to wprd 
decoding of speech. 



peech recognition, comprising the steps of: 
vocabulary comprising baseforms into baseform 
components; and 



mapping on said baseform components so 
parts table for use in subsequent 



14. The method 
generated from a textual corpus by app 




13, wherein said acoustic vocabulary is 
sound to spelling mapping to said textual 



corpus, and said method further comprises generating a language model vocabulary 
from said textual corpus. 

15. The method of claim 14, further comprising partitioning said 
language model vocabulary and splitting said partitioned language vocabulary into 
vocabulary components. 

16. The method of claim 15, wherein said steps of splitting said 
acoustic vocabulary and splitting said partitioned language vocabulary are performed 
using the same splitting criteria 
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1 17. The method of claim 13, wherein said splitting comprises 

2 splitting baseforms of average size lengths into a first number of components and 

3 splitting baseforms of relatively longer lengths into a larger number of components. 

/ 

5 18. The method of claim 13 wherein said baseform components are 

6 generated independently from language model components. 

i 

1 19. The method of claim 13, further comprising: 

2 performing spelling to sound mapping which includes applying a 
J$ predetermined set of rules to each word lhsj/word string of a textual corpus, with 
yl pronunciations of words being obtainecMmm aVord to baseform table; and 

ffS baseforms stored in said worn to baseform table are collected in said 

l_6 acoustic vocabulary. / 

jjl 20. The method of claim 19, further comprising making entries in 

2 said baseform components to word paris table by applying spelling to sound mapping to 

3 strings of components, said strings of components being obtained by filtering words of 

4 said textual corpus. 

1 21. The method of c aim 19, further comprising applying said rules 

2 to a language model vocabulary so as i o produce new word/baseform pairs in said 

3 word to baseform table. , 
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22. The method of claim 19 wherein said sound to spelling mapping 
is performed via an inversion of said set of rules. 

23. The method of claim 22 wherein said sound to spelling mapping 



produces said baseform components to word parts 



table by utilizing data from said 



—7— 

24. The method of claim 13^, wherein said splitting is performed 
subject to a constraint in which a woro^at ^ontains a given string of letters is 
prevented from being split within the strnjg^the string of letters corresponds to one 
phoneme. 



25. The method of claim 13, wherein said splitting is performed 
using a sorted and fixed vocabulary and 4 fixed list of allowable endings including an 
empty ending, with each word from the fixed vocabulary being split into a stem and an 



ending that is an element of the fixed set 



of endings, so as to substantially minimize the 



total number of all stems that are required to split every word in the fixed vocabulary. 



26. A method for decoding a speech utterance using language model 



components and acoustic components, 



comprising the steps of: 



(a) generating from saic utterance a stack of baseform component paths; 
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(b) concatenating baseform componen s in a path to generate 
concatenated baseforms, when the concatenated basejform components correspond to a 
baseform found in an acoustic vocabulary; 

(c) mapping said concatenated basefoifms into words; 

(d) computing language model (LM) scores associated with said words 
using a language model, and performing further dec/oding of said utterance based 
thereupon. 



27. The method of clai 
mapping said words into a s 
computing said LM scores for 




rein said step (d) includes: 
sub-words; 

aid sub- words; and 



attaching said LM scores to words that produced the corresponding 



strings of sub-words and performing said furthe 



* decoding based thereupon. 



28. The method of claim 26, 
steps of producing, from said utterance, a set oil 



therein said step (a) includes the sub- 
baseform component strings, and 



generating said stack of baseform component p 



j 



ths from said strings. 
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29. The method of claim 26, further comprising the steps of: 

(e) mapping the baseform components of said path into word parts, 
when the concatenated baseform components thereof do not form a baseform found in 
the acoustic vocabulary; 

(f) generating a LM score for an n-tuple of said word parts; 



(g) designating a concatenated word form as a valid word, if the LM 
score for the n-tuple of word parts exceeds a specific threshold, and adding the valid 
word to a word stack for further decoding. 



30. The method of claifn further comprising producing an N-best 
list containing a set of N sentences having ttie highest likelihood scores, said likelihood 
scores being a measure of how well candidate sentences match acoustic data. 



31. The method of claim! 26, wherein said language model comprises 
a mixture of arithmetic and linguistic language model components. 



32. The method of claim 
configured to split word numbers into smal 
method. 



31 wherein said language model is 
er numbers via a modular arithmetic 
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33. The method of claim 3 lj wherein said words comprise k-tuple 
word forms, and said method further comprises: 

mapping said k-tuple word fonjL into word numbers; 

splitting said word numbers into a t-tuple of integers; 

if a probability score for said t-tuples of integers exceeds a 
predetermined threshold, decoding said t-tuple of integers; else splitting said k-tuple of 
words into L-tuple linguistic components, computing probability scores therefor, and 
performing decoding based thereon. 



34. The method of claim 
via linguistic splitting based on morphemes 



31, further comprising splitting said words 




35. The method of cmM26, further comprising splitting said words 
via linguist splitting based on morphemes 

36. The method of claim 26, further comprising splitting said words 
via linguistic splitting based on any one of spelling, phones and morphemes. 



37 . The method of clain 



33 wherein said splitting said k-tuple of 



words into L-tuple linguistic components comprises linguistic splitting based on any 
one of spelling, phones and morphemes. 
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38. The method of claim 26 wherein said LM scores are computed 
using a smoothing process for linguistic components, said smoothing process 
comprising: 

verifying whether first an& second candidate stems of one of said words 
has the same set of possible endings by comparing stored ending lists for the respective 
stems with one another. 



39. The method 
counting a number of tii 




associated with said first stem follows t le first stem; 



counting the number of 
list associated with said second stem follows 



I 

i*8 wherein said verifying comprises: 
h of the endings in a first said ending list 



tpies each of the endings in a second said ending 
said second stem; and 



processing counts resulting from said counting in accordance with a 
predetermined set of conditions, with probabilities for endings being set if said 
conditions are satisfied. 
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40. The method of claim 39 wherein said set of conditions comprises: 
said first stem must have hig|l counts for all possible endings that follow 



it and said second stem must have low cou; 
second stem; 




wherein if said set of 
endings following said second stem are set 
said first stem. 



ts for at least some endings that follow said 



ions is satisfied, then the probabilities for 
^as probabilities for these endings to follow 




41. The method of claiift 39 wherein said set of conditions comprises: 
both said first and second sjtems must belong to a particular class. 

42. A method for splitting words in a language vocabulary V in an 
utomatic speech recognition system to provide vocabulary compression, wherein the 

vocabulary V has a fixed sizb, the method comprising the steps of: 

(a) providing a fixed set of allowable endings, including an empty 

ending; 

(b) providing\a fixed set of constraints for splitting words into stems; 

(c) initializingta split map of words and the corresponding stems and 
endings by setting a variable t to \ predetermined value, and selecting a first word from 
the fixed vocabulary; 
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10 (d) randomly splitting the first word to generate an ending from the 

11 fixed list of allowable endings and a stem; 

12 (e) defining and storing a stem set containing the stem generated at 

13 said splitting step (d) and a word set containing the first word; 

14 (f) determining \^hether t is less than the size of the vocabulary V; 

15 (g) obtaining a new word from the vocabulary V, when t is less than 

16 the size of the vocabulary V; \ 

17 (h) determining possible splits for the new word to generate stems 

18 and endings therefrom, using the fixed let of allowable endings and the fixed set of 
1*5 constraints; 1 

2B! (i) determining whether there is a split for the new word that 

2t generates a previously stored stem of thelstem set; 

2 J 0) splitting the currentlword into the previously stored stem and an 

23j ending of the set of allowable endings, when there is a split for the new word that 

2% generates the previously stored stem of thelstem set; 

(k) determining whether another previously stored stem in the stem 

26 set can be replaced by a new stem generated at step (h), when there is no split for the 

27 current word that generates the previously stared stem of the stem set; 
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(1) redefining the stei n set and the split map to include the new stem 
generated at step (h) in place of the other previously stored stem, when the other 
previously stored stem can be replaced )y the new stem generated at step (h); 

(m) redefining the stem set to include any new stem into which the 
current word may be split and extending the split map to include the current word by 
splitting the new word into the new stem, when the other previously stored stem in the 
stem set cannot be replaced by the new stem generated at step (h); and 



(n) incrementing t and 
the vocabulary V. 



returning to step (f) if t is less than the size of 



43. The method of claim 42, further comprising the step of 
terminating the method if t is not less thah the size of the fixed vocabulary. 

44. The method of claim 42, wherein said determining step (k) 
comprises the step of determining whether other words stored in the word set during 



previous iterations will remain split after 



such substitution. 



45. The method of claim 



that the words in the language vocabulary 
on frequencies associated with each of the 



42, wherein the vocabulary is sorted such 
/ are numerated in descending order based 
vords. 



J 
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46. The method of claim 42, wherein step (j) further comprises the 
step of extending the split map to the new word. 

47. The method of claim 42, wherein step (i) generates all possible 
splits for the new word. I 
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